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Abstract 

We study the problem of estimating multiple linear regression equations for the purpose 
of both prediction an d variable selection. Following recent work on multi-task learning 
Argyriou et al. 1200 8^. we assume that the regression vectors share the same sparsity pat- 



tern. This means that the set of relevant predictor variables is the same across the different 
equations. This assumption leads us to consider the Group Lasso as a candidate estima- 
tion method. We show that this estimator enjoys nice sparsity oracle inequalities and vari- 
able selection properties. The results hold under a certain restricted eigenvalue condition 
and a coher e nce c o ndition on the design matrix, which naturally extend recent work in 



Bickel et all 11200711 . iLounicil 11200811 . In particular, in the multi-task learning scenario, in 
which the number of tasks can grow, we are able to remove completely the effect of the 
number of predictor variables in the bounds. Finally, we show how our results can be ex- 
tended to more general noise distributions, of which we only require the variance to be 
finite. 



1 Introduction 



We study the problem of estimating multiple regression equations under sparsity assumptions 
on the underlying regression coefficients. More precisely, we consider multiple Gaussian re- 
gression models, 

1/1 = x^pi + w^ 

V2 = + ^^^^ 

Vt = XtPt + 

where, for each t = 1, . . . , T, we let be a prescribed n x M design matrix, the unknown 
vector of regression coefficients and yt an n-dimensional vector of observations. We assume 
that Wi, . . . , Wt are i.i.d. zero mean random vectors. 

We are interested in estimation methods which work well even when the number of param- 
eters in each equation is much larger than the number of observations, that is, M ^ n. This 
situation may arise in many practical applications in which the predictor variables are inher- 
ently high dimensional, or it may be "co stly" to observe response variables, due to difficult 
experimental procedures, see, for example I Argyriou et al. 112008] for a discussion. 



Ex a mples in which this estimation problem i s relevant range fr o m mu lti-task learning lArgyriouetal 
[l2008l] . ICavallanti et~aD [l2008l]. Maurer ['2 006ll. lObozinski et all [I2OO8I1 and conjoint analysis 
(see, for example, Evaeniou et all [ | 2007|1 . iLenk et al. 11996] and ref erences therei n ) to longitu- 
dinal d ata analysis D iggle [ , 2002jl as well as the analysis of panel data Hsiao [ 2003 ]. Wooldridge 
12002!] ■ among others. In particular, multi-task learning provides a main motivation for our 
study. In that setting each regression equation corresponds to a different learning task (the clas- 
sification case can be treated similarly); in addition to the requirement that M n, we are also 



interested in the case that the number of tasks T is much larger than n. Following Argyrio u et al. 



1)2008,1 we assume that there are only few common important variables which are shared by the 
tasks. A general goal of this paper is to study the implications of this assumption from a statis- 
tical learning view point, in particular, to quantify the advantage provided by the large number 
of tasks to learn both the underlying vectors PI, . . . , as well as to select common variables 
shared by the tasks. 

Our study pertains and draws substantial ideas from the recently d eveloped area o f com- 
presse d sensing and sparse es timation (or sparse recovery), see lBickel et al. [2007], Cand es and Tao 
[|2005n . iDonoho et al.l [|2006n and references therein. A central problem studied therein is that 
of estimating the parameters of a (single) Gaussian regression model. Here, the term "sparse" 
means that most of the components of the underlying M-dimensional regression vector are 
equal to zero. A main motivation for sparse estimation comes from the observation that in 
many practical applications M is much larg er than the numb e r n of o bservations bu t the un der- 
lying model is (approximately) sparse, see ICandes and Tad 11200511 . lOonoho et al.l ll2006ll and 
references therein. Under this circumstance ordinary least squares will not work. A more ap- 
propriate method for sparse estimation is the £i-norm penalized least squares method, which 
is commonly referred to as the Lasso method. In fact, it has been recently shown by different 
authors, under d ifferent conditions o n the design matrix, th at the Lasso satisfies sparsity oracle 
inequalities, see lBickel et al.l COOVll . lBunea et al.l 112007 aUbll . Ivan de Geen 1I2OO8I1 and references 
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therein. Closest to our study in this paper is lBickel et alj [|2007ll . which relies upon a Restricted 
Eigenvalue (RE) assumption. The results of these works make it possible to estimate the pa- 
rameter P even in the so-called "p much larger than n" regime (in our notation, the number of 
predictor variables p corresponds to MT). 

In this paper, we assume that the vectors are not only sparse but also have the 

same sparsity pattern. This means that the set of indices which correspond to non zero compo- 



nents of 131 is the same for every t 



, T. In other words, the response variable associated 



with each equation in (|l.ll) depends only on a small subset (of size s <ti M) of the corre- 
sponding predictor variables and the set of relevant predictors is preserved across the different 
equations. This assumption, that we further refer t o as structured sparsi ty assumption, is mo- 
tivated by some recent work on multi-task learning lArgyriou et al.l [|2008ll. It nat urally leads to 
an extension of the Lasso method, the so-called group Lasso lYuan and Lin [ 2006 1. in which the 
error term is the average residual error across the different equations and the penalty term is a 
mixed (2, l)-norm. The structured sparsity assumption induces a relation between the responses 
and, as we shall see, can be used to improve estimation. 

The paper is organized as follows. In Section [2] we define the estimation method and com- 
ment on previous related work. In Section [3] we study the oracle properties of this estimator 
when the errors Wt are Gaussian. Our main results concern upper bounds on the prediction 
error and the distance between the estimator and the true regression vector (3* . Specifically, 
Theorem 13.11 establishes that under the above structured sparsity assumption on (3* , the predic- 
tion error is essentially of the order of s/n. In particular, in the multi-task learning scenario, 
in which T can grow, we are able to remove completely the effect of the number of predictor 
variables in the bounds. Next, in Section IH under a stronger condition on the design matrices, 
we describe a simple modification of our method and show that it selects the correct sparsity 
pattern with an overwhelming probability (Theorem 14. II ). We also find the rates of convergence 
of the estimators for mixed (2, 1) -nor ms with 1 < p < o o (T heorem 14.21). T he techniques of 



proofs build upon and extend those of iBickel et al.l ll2007ll and lLounicil [|2008ll . Finally, in Sec 



tion[5]we discuss how our results can be extended to more general noise distributions, of which 
we only require the variance to be finite. 



2 Method and related work 

In this section we first introduce some notation and then describe the estimation method which 
we analyze in the paper. As stated above, our goal is to estimate T linear regression functions 
identified by the parameters (31,. . . ,(3^ E M^^. We may write the model (11.11 ) in compact 
notation as 

y = X(3* + W (2.1) 

where y and W are the nT-dimensional random vectors formed by stacking the vectors yi, . . . ,yT 
and the vectors Wi, . . . , Wt, respectively. Likewise P* denotes the vector obtained by stack- 
ing the regression parameter vectors P^, . . . Unless otherwise specified, all vectors are 
meant to be column vectors. Thus, for every t E Nt, we write yt = {yu ■ i E N„)^ and 
Wt = (Wti : i E N„)^, where, hereafter, for every positive integer k, we let be the set of 
integers from 1 and up to /c. The nT x MT block diagonal design matrix X has its t-th block 
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formed by the n x M matrix Xf. We let x J, ... , xj^ be the row vectors forming Xt and {xti)j 
the j-th component of the vector xu. Throughout the paper we assume that xu are deterministic. 

For every (3 G M*^^ we introduce = = {(3tj : t G Nr)^, that is, the vector formed 
by the coefficients corresponding to the j-th variable. For every 1 < p < oo we define the 
mixed (2,p)-normof (3 as 



2,P 




M 



V IIP 



and the (2, oo)-norm of (3 as 



||/3||2,oo= max ||/?^| 

1<J<M 



where || ■ || is the standard Euclidean norm. 

If J C Nm we let (3 J e R^^^ be the vector formed by stacking the vectors {(3^1{j E J} : j E 
Nm), where /{■} denotes the indicator function. Finally we set J(/3) = {j : ^ 0, j E Nm} 
and M{p) = \J{P)\ where \J\ denotes the cardinality of set J C {1, . . . , M}. The set J{p) 
contains the indices of the relevant variables shared by the vectors Pi, . . . , Pt and the number 
M{P) quantifies the level of structured sparsity across those vectors. 

We have now accumulated the sufficient information to introduce the estimation method. 
We define the empirical residual error 

= ^11 T^i^tA - yur = ^\\xp-yr 

t=l i=l 



and, fo r every A > 0, we let our estimator /3 be a solution of the optimization problem lArgyriou et al 



mom 



min^(/?) + 2A||/5||2,i. (2.2) 

In order to study the statistical properties of this estimator, it is useful to derive the opti- 
mality condition for a solution of the problem (|2.2I) . Since the objective function in (12.21) is 
convex, /5 is a solution of (12.21) if and only if (the MT-dimensional zero vector) belongs to the 
subdifferential of the objective function. In turn, this condition is equivalent to the requirement 
that 

/ M 

-VS0) E2XdiYl 

\i=i 

where d denotes the subdifferential (see, for example. iBorwein and Lewis 1 2006 1 for more in- 
formation on convex analysis). Note that 





>A/T = if pi ^ 0, 



l^^ ll < 1, if = 0, J G N 



M 
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Thus, P is a solution of (12.21) if and only if 



riT Bj 



1 

nT 



\\{x^y~xp)y\\<x, 



^ 
if (3^ = 0. 



Finally, let us com ment on previous re lated work. Our estimator is a special case of the 
group Lasso estimator I Yuan and LinI ll2006ll. Sever a l papers analyzing sta t istical |)roperties of 
the g roup Lasso appeare d quite recen t lvlBach Jl2008'l, Chesneau and Hebiri l'2007'l,' Huang et al.l 



2008.1 . , Koltchinskii and YuanI 1 120081 1 . Meier et aL [2006. 2008,1. Nardi an d Rinal do 1,2008.1.. Ravikumar et al, 
200711 ■ Most of them are focused on the group Lasso for additive rnodels Hua ng et al. [l2008ll . 
Koltchinskii and YuanI [|2008ll . Meier et all i2008ll.lRa vikumar et al] [|2007ll or generalize d linear 
models iMeier et al.l [|2006l1 . Special choice of groups is studied in lChesneau and Heb iri 
Discussion of the group Las so in a r elatively general setting is given by Bach lBachI [[2008i1 and 
Nardi and Rinaldo Nardi and Rinaldo. [,2008 1. Bach Bach [20081 assumes that the predictors xu 
are random with a positive definite covariance matrix and proves results on consistent selection 
of sparsity pattern J(/3*) wh en the dimension of the rn odel (p = MT in our case) is fixed and 
n ^ oo. Nardi and Rinaldo iNardi and Rinaldo! ll2008ll conside r a setting that cove rs ours and 
address the issue of sparsity oracle inequalities in the spirit of iBickel et al.l 1 20071. However, 



their b ounds are too coarse (see comments in Section [3]below). Obozinski et al. lObozinski et al. 



[[20080 replace in (12.21) the (2, 1) -norms by (g, 1) -norms with g > 1 and show that the resulting 
estimator achieves consistent selection of the sparsity pattern under the assumption that all the 
rows of matrices Xt are independent Gaussian random vectors with the same covariance matrix. 

This literature does not demonstrate theoretical advantages of the group Lasso as compared 
to the usual Lasso. One of the aims of this paper is to show that such advantages do exist in 
the multi-task learning setup. In particular, our Theorem 13.11 implies that the prediction bound 
for the group Lasso estimator that we use here is by at least a factor of T better than for the 
standard Lasso under the same assumptions. Furthermore, we demonstrate that as the number 
of tasks T increases the dependence of the bound on M disappears, provided that M grows at 
the rate slower than exp(-\/T). 



3 Sparsity oracle inequality 

Let 1 < s < M be an integer that gives an upper bound on the structured sparsity M{(3*) of the 
true regression vector P*. We make the following assumption. 

Assumption 3.1. There exists a positive number n = k{s) such that 

< 3||Aj||2,l } > K, 

where denotes the complement of the set of indices J. 
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To emphasize the dependency of Assumption 13.11 on s, we will sometimes refer to it as 
Assumption RE(s). This is a natural extension to our s etting of the Restric ted Eigenvalue as- 



sumption for the usual Lasso and Dantzig selector from lBickel et al.l [1200711 . The £i norms are 
now replaced by the mixed (2,l)-norms. Note that, however, the analogy is not complete. In 
fact, the sample size n in the usual Lasso setting corresponds to nT in our case, whereas in As- 
sumption [3TT] we consider a/ A^X^XA/n and not ^y A'^ X A/ (nT) . This is done in order 
to have a correct normalization of k without compulsory dependence on T (if we use the term 
y/A'^X^XA/{nT) in Assumption [3ll then K T ' even in the case of the identity matrix 
X^X/n). 



Several simple sufficient conditions for Assumption |3.1| with T = 1 are given in lBickel et al. 



[|2007ll . Similar sufficient conditions can be stated in our more general setting. For example, it is 



enough to suppose that ea ch of the matrice s X^Xt/nh positive definite or satisfies a Restricted 



Isometry condition as in ICandes and Tad [|2005ll or the coherence condition (cf. Lemma 14.11 
below). 

Lemma 3.1. Consider the model (|l.ll) for M > 2 and T,n > 1. Assume that the random 
vectors Wi, . . . , Wt are Ltd. Gaussian with zero mean and covariance matrix a^Inxn, all 
diagonal elements of the matrix X^ X/ n are equal to 1 and M{(3* ) < s. Let 

2a / AlogM^ ^'^ 
A — — 1= 1 



where A> S and let q = min(8 log M, A^/T/8). Then with probability at least 1 — M^~'^, for 
any solution [3 of problem (|2.2I) and all (3 G M^^-^ we have 

±,\\X0-f3*)r + X\0-f3h,,< (3.1) 
nl 

<^||X(/3-/3*)|p + 4A J2 

M(/3)<i^||X(/3-/3*)f, (3.3) 

where 0max is the maximum eigenvalue of the matrix X^X/n. 
Proof. For all /3 G M^^^, we have 

^ M ^ M 

— llX/3 - yf + 2A ^ 0m < _||X/? - + 2A ^ \\(3^\ 

which, using y = X/3* + W, is equivalent to 

l^\\X0-f3*)r<^\\X{f3-f3*)r 
nl nl 

2 ^ 
+ ^W^X0 - /5) + 2A 5^ (11/3^11 - (3.4) 
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By Holder's inequality, we have that 

W^xCP-(3) < \\X^W\\2,oo0-f3\\2,i 

where 

iix^iy II2 oo = max 



\ t=l \i=l 



2 

U 



Consider the random event 



^={_L||A-W-||,„<^}^ 

Since we assume all diagonal elements of the matrix X^X/n to be equal to 1, the random 
variables 

1 " 

^ i=i 

t = 1, . . . , T, are i.i.d. standard Gaussian. Using this fact we can write, for any j = 1, . . . , M, 



t=i \i=i 

Pt[Xt> 



= Pr (xt > ^ + AVt log , 

where Xt ^ chi- square random variable with T degrees of freedom. We now apply Lemma 
lA.ll the union bound and the fact that A > 8 to get 

Pr(^^) < Mexp (^- ^^^^^ mm (^Vt, Alogiw)^ 
It follows from (|3.4I) that, on the event A. 

M 

— \\X0-(3*)r + Xj2\0'-f3'\\< 
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which coincides with (13.11) . To prove (13. 2h . we use the inequality 



^ max^||(X-(l/-X/3)y|| < A, (3.5) 



which follows from (12.31) and (12.41) . Then, 

^\\{x-{xfi-y)y\\ + ^\\{x-wyi 

where we have used y = XP* + W and the triangle inequality. The result then follows by 
combining the last inequality with inequality (|3.5I) and using the definition of the event A. 
Finally, we prove (|3.3I) . First, observe that, on the event A, 

±^\\(X-X0-[3*)y\\>^, if/3V0. 

This fact follows from (12.31 ). (12.11 ) and the definition of the event A. The following chain yields 
the result: 

M0) < E \\ix-x0-(3*)yr 

A 

j=l 

^ "X^X(/3-/?*)f 



A2(nT) 



< ^l|X(/3-/3*)f. 



We are now ready to state the main result of this section. 

Theorem 3.1. Consider the model (|l.ll) for M > 2 and T,n > 1. Assume that the random 
vectors Wi, . . . , Wt are i.i.d. Gaussian with zero mean and covariance matrix a'^Inxn, all 
diagonal elements of the matrix X^X/n are equal to 1 and M{j3*) < s. Furthermore let 
Assumption \3 . l\ hold with k = k{s) and let 0max the largest eigenvalue of the matrix X^X/ n. 
Let 

2a ( AlogM^^/^ 
A = ^= 1 + 



'nT V VT 

where A > 8 and let q = min(8 logM, A\/T/8). Then with probability at least 1 — M^~'^, for 



1 



any solution {3 of problem (|2.2I) we have 



1 ..3_,9.,|,,<5H^^, l + 

\/n 



M0) < ^l^s. 



If, in addition, Assumption RE(2s) holds, then with the same probability for any solution f3 of 
problem (|2.2I) we have 



T k2(2s) V ^ V VT 



Proof. We act similarly to the proof of Theorem 6.2 in lBickel et all 1200711 . Let J = J{(3*) 



{j : [j3*y ^ 0}. By inequality (lO) with (3 = (3* we have, on the even ^, that 

<4Av/i||(/5-/3*)j||, 



nT 

i6J 



Moreover by the same inequality, on the event A, we have J2jLi ~ P*^ ^ 4 ~ 
p*m, which implies that J^j^j- - < 3 J^jeJ H/^^ " Thus, by AssumptionO 

||(/3-/3*).||<^%^. (3.11) 



Now, (|3.6I) follows from (|3.10l) and (13.111) . Inequality (|3.7I) follows again by noting that 

M 

and then using (|3.6I) . Inequality (|3.8I) follows from (13.31) and (|3.6I) . 

Finally, we prove (13.91) . Let A = /3 — /3* and let J' be the set of indices in corresponding 
to s maximal in absolute value norms || A-' || . Consider the set = JUJ'. Note that \J2s\ = 2s. 
Let II denote the A;-th largest norm in the set {|| A-'H : j G J'^}. Then, clearly, 

IIA^ill <J2\\^'\\/k=\\^j42,i/k. 
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This and the fact that ||Ajc||2,i < 3|| Aj||2,i on the event A implies 

°° HA l|2 

~ s ~ s 



Therefore, on A we have 



and also from (13.101) : 



In addition, ||Ajc||2,i < 3||Aj||2,i easily implies that 

|Aj|J|2,i < 3||Aj2j|2,i. 



Combining these facts and (13.131) with Assumption RE(2s) we find that on the event A the 
following holds: 

IIA 1I<1^ 

This inequality and (I3TT21) yield (IX9l) . ■ 

Theorem l3.1l is valid for any fixed n, M, T; the approach is non-asymptotic. Some relations 
between these parameters are relevant in the particular applications and various asymptotics can 
be derived as corollaries. For example, in multi-task learning it is natural to assume that T > n, 
and the motivation for our approach is the strongest if also M ^ n. The bounds of Theorem 
13.11 are meaningful if the sparsity index s is small as compared to the sample size n and the 
logarithm of the dimension log M is not too large as compared to v^. 

Note also that the values T and y/T in the denominators of the right-hand sides of (13.61) . 
(13.71) . and (13.91 ) appear quite naturally. For instance, the norm \\(3 — (3* ||2,i in (13.71) is a sum of 
M terms each of which is a Euclidean norm of a vector in ^ , and thus it is of the order \/T 
if all the components are equal. Therefore, (13.71) can be interpreted as a correctly normalized 
"error per coefficient" bound. 

Several important conclusions can be drawn from Theorem 13.11 

1. The dependence on the dimension M is negligible for large T. Indeed, the bounds of 
Theorem 13 . 1 1 become independent of M if we choose the number of tasks T larger than 
log^ M. A striking fact is that no relation between the sample size n and the dimension 
M is required. This is quite in contrast to the previous results on sparse recovery where 
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the assumption logM = o{n) was considered as sine qua non constraint. For example, 
Theorem 13.11 gives meaningful bounds if M = exp{Tf) for arbitrarily large 7 > 0, 
provided that T > n^"'. This is due to the structured sparsity assumption that we naturally 
exploit in the multi-task scenario. 

2. Our estimator is better than the standard Lasso in the multi-task setup. Theorem 13.11 
witnesses that our group Lasso estimator admits substantially better error bounds than the 
usual Lasso. Let us explain this considering the example of the prediction error bound 
(13.91) . Indeed, for the same multi-task setup, we can apply a usual Lasso estimator l3^, 
that is a solution of the following optimization problem 

T M 

min5(/?) + 2A5^5^|A,|. 

t=i j=i 

Assume, for instance, that we are in the most favorable situation where M < n, each of 
the matrices j^XfXt is positive definite and has minimal eigenvalue greater t han (this. 



of cou rse, implies Assumption 3.1). We can then apply inequality (7.8) from lBickel et al 



mOTW with 



nT 



where A > 2^2, to obtain that, with probability at least 1 - (MT)^" — , it holds 



Indeed, when applying (7.8) of lBickel et al.l [1200711 we account for the fact that the param 



eters n, M, s therein correspond to nT, MT, sT in our setup, and the minimal eigenvalue 
of the matrix ■;^X^X is greater than /T. Comparison with (13.91 ) leads to the conclu- 
sion that the prediction bound for our estimator is by at least a factor of T better than for 
the standard Lasso under the same assumptions. Let us emphasize that the improvement 
is due to the property that P* is structured sparse. The second inherent property of our 
setting, that is, the fact that the matrix X^X is block-diagonal, can be characterized as 
important but not indispensable. We discuss this in the next remark. 

3. Theorem \3. 1 1 applies to the general group Lasso setting. Indeed, the proofs in this sec- 
tion do not use the fact that the matrix X^X is block-diagonal. The only restriction 
on X^X is given in Assumption 3.1. For example. Assumption 3.1 is obviously satis- 
fied if X^X/ (nT) (the correctly normalized Gram matrix of the regression model (|2.1I) ) 
has a positive minimal eigenvalue. However, the price for having this property (or As- 
sumption 3.1 in general), as well as the resulting error bounds, can be different for the 
block-diagonal (multi-task) setting and the full matrix X setting. 



Finally, we note that lNardi and Rinaldd [|2008ll follow the scheme of the proof of lBickel et al 



[I2OO7II to derive similar in spirit to ours but coarse oracle inequalities. Their results do not ex- 



plain the advantages discussed in the points 1-3 above. Indeed, the tuning parameter A chosen 
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in lNardi and Rinaldd 1120081], pp. 614-615, is larger than our A by at least a factor of a/T. As a 
consequence, the corresponding bounds in the oracle inequalities of iNardi and Rinaldd ll2008ll 
are larger than ours by positive powers of T. 



4 Coordinate-wise estimation and selection of sparsity pat- 
tern 



In this section, we show how from any solution of the problem (12.21) we can reliably estimate 
the correct sparsity pattern with high probability. 

We first introduce some more notation. We define the Gram matrix of the design \1/ = 
iX^X. Note that is a MT x MT block-diagonal matrix with T blocks of dimension MxM 
each. We denote these blocks by = = {^tj,tk)j,k=i,...,M- 

In this section we assume that the following condition holds true. 

Assumption 4.1. The elements '^tj,tk of the Gram matrix \E' satisfy 

^t,,tj = 1, VI ^ J ^ M, 1 ^ t < T, 

and ^ 

max \'^tj,tk\ ^ — , 
ifit<:T,jytk 7as 

for some integer s ^ 1 and some constant a > 1. 

Note that the above assumption on ^ implies As sumption 13. li as we prove in the following 
lemma. 

Lemma 4.1. Let Assumption \4.1\ be satisfied. Then Assumption I3.il is satisfied with k, = 

a 

Proof. For any subset J of {1, . . . , M} such that \J\ ^ s and any A G M^^^ such that 

II Ajc||2,i ^ 3||Aj||2,i, we have 

Aj^Aj ^ A}{^ - Imtxmt)Aj 

IIAjIP IIAjP 

1 ( EjejSr=i l^til) 

> 1 ^— 

7as ||Aj||2 

1 

> 1 

7a 

where we have used As sumption 14.11 and the Cauchy-Schwarz inequality. Next, using consecu- 
tively Assumption l4.11 the Cauchy-Schwarz inequality and the inequality || A jc ||2,i ^ 3|| A j||2,i 
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we obtain 



A}.^Aj| ^ 1 Et=iEjejEkeJ^\^t,\Wk\ 



|AjP " 7as ||AjP 

1 V II A-'' II II A*-' I 

i l^je.j,keJ'= 11^ II 11^ I 



7as IIAjP 



3 IIAjIIL 



7«s IIAjP 
7a 



Combining these inequalities we find 



||A,P ^ IIA^P ^ ||A,P ^ a ■ 



Note also that, by an argument as in iLounicil 120080, it is not hard to show that under As- 
sumption |4?T] the vector P* satisfying (12.11) is unique. 

Theorem l3. 1 [ provides bounds for compound measures of risk, that is, depending simultane- 
ously on all the vectors . An important question is to evaluate the performance of estimators 
for each of the components separately. The next theorem provides a bound of this type and, 
as a consequence, a result on the selection of sparsity pattern. 

Theorem 4.1. Consider the model ^i.il) for M ^ 2 and T,n ^ 1. Let the assumptions of 
Lemma lJJl be satisfied and let Assumption \4J] hold with the same s. Set 

32 

■ a. 



7(a- 1 



Let X, A and Wi , . . . , Wt be as in Lemma 13.71 Then with probability at least 1 — where 
q = min(8 log M, AVT/8), for any solution /3 of problem l \2.2\i we have 




AlogM 
1 + - 



mm —=\\{fjy\\>—=Al + 



If in addition, 



then with the same probability for any solution P of problem A2.2\l the set of indices 




estimates correctly the sparsity pattern J{P*), that is, 

J=J{P*). 
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Proof. Set A = (3 — [3*. Using Assumption 14. II we obtain 



|A||2,oo ^ ||^A||2,oo + ||(^ - /mTxMt) All 2,oo 
^ II^A||2,oo 



T 



+ max > 



.t=l 



tk 



^ II^AI 



2,oo 



M / T \ 

k=i,k^j \t=i / 



1/2 



< II^AI 



2,oo 



|A||2,iVr 

7as 



Thus, by Lemma [BTI and Theorem 3.1, with probability at least 1 — ^, 



By Lemma \4~\l an^ = a — 1, which yields the first result of the theorem. The second result 
follows from the first one in an obvious way. ■ 



Assumption of type (14.11) is inevitable in the context of selection of sparsity pattern. It says 
that the vectors (Z?*)-' cannot be arbitrarily close to for j in the pattern. Their norms should be 
at least somewhat larger than the noise level. 



Th e second result of Theoreml4. 1 [ (selection of sparsity pattern) can be compared with lBach 
[|2008[] . iNardi and Rinaldol [12008 1 who considered the Group Lasso. There are severa l differ - 
ences. First, our e s timato r J is based on thresholding of the norms \fi^\, while iBachI [|2008ll . 
Nardi and Rinaldo 1I2OO8I1 take instead the set where these norms do not vanish. In practice, the 



latter is known to be a poor selector; i t typi cally overestimates the true sparsity pattern. Sec- 
ond, iBachI [120081], iNardi and Rinaldol [|2008[1 consider specific asymptotic settings, while our 
result holds for any fixed n, M, T. Different kinds of asymptotics can be therefore obtained 
as simple corollar i es. F inally, note that the estimator /3 is not necessarily unique. Though 



Nardi and Rinaldol [|2008ll does not discuss this fact, the proof there only shows that there exists 
a subsequence of solutions (3 of {\2.2\l such that the set {j : 11(3^] 7^ 0} coincides with the spar- 
sity pattern Jj P*) in some specified asymptotics (we note here the "if and only if claim before 
formula (23) in lNardi and Rinaldol [|2008ll is not proved). In contrast, the argument in Theorem 
[4.1 [ does not require any analysis of the uniqueness issues, though it is not excluded that the 
solution is indeed unique. It guarantees that simultaneously for all solutions (3 of {\2.2\l and any 
fixed n, M, T the correct selection is done with high probability. 
Theorems 3.3 and [4.1 [ imply the following corollary. 



Corollary 4.1. Consider the model ALU for M ^ 2 and T,n ^ 1. Let the assumptions of 
Lemma \3J\ be satisfied and let Assumption \4.1\ holds with the same s. Let \, A and Wi, . . . , Wt 
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be as in Lemma UJl Then with probability at least 1 — where q = min(8 log M, A\/T / 8), 

for any solution (] of problem A2.2\l and any 1 < p < oc we have 



\\P - P \\2,p ^ Cicr^-i /I + 



/T V VT 

where 

If in addition, l\4.1\) holds, then with the same probability for any solution (3 of problem A2.2\) 
and any 1 < p < oo we have 



^ h5 a*n <,J'^\'^' L^AlogM 
P \\2,p ^ cia — l + 



where J is defined in t\4. 1 1) . 
Proof. Set A = P — p. For any p ^ 1 we have 

1 /I \p f I \^'^ 



T \VT J \VT 



Combining (|3.7I) . (|4.1I) with k = y 1 — ^ and the above display yields the first result. ■ 

Inequalities (14.11) and (14.11) provide confidence intervals for the unknown parameter P* in 
mixed (2,]?) -norms. 

For averages of the coefficients Ptj we can establish a sign consistency result which is 
somewhat stronger than the result in Theorem 14. 1[ For any P e M^, define sign(/5) = 
(sign(/3^), . . . , sign(/3^^))^ where 



Introduce the averages 





if t > 0, 


sign(t) = < 


if t = 0, 




if t < 0. 


1 ^ 


t=l 



Consider the threshold r = w 1 + ^'^"^^ and define a thresholded estimator 



I{\aj \ > t}. 



Let a and a* be the vectors with components aj and a*, j = 1, . . . , M, respectively. We need 
the following additional assumption. 
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Assumption 4.2. It holds: 



mm \aA>—=\ l^ ^ — . 



This assumption says that we cannot recover arbitrarily small componen ts. Similar assu mp 



tions are standard in the literature on sign consistency (see, for example, iLounicil [1200811 for 
more details and references). 

Theorem 4.2. Consider the model ^i.il) for M ^ 2 and T,n ^ 1. Let the assumptions of 
Lemma I3.il be satisfied and let Assumption \4.1\ hold with the same s. Let A and A be defined 
as in Lemma \3.1\ and c as in Theorem \4.1\ Then with probability at least 1 — M^~'^, where 
q = min(81ogM, A\fT jK), for any solution jS of problem d2.2D we have 



/ c I A\ogM 
max a, — a-\ ^ / 1 H 



If, in addition, Assumption \4.2\ holds, then with the same probability, for any solution (3 of 
problem d2.2D . a recovers the sign pattern of a*: 

sign(a) = sign(a*). 

Proof. Note that for every j G Na/ 



ttj - aj\ < -^\\/3- P ||2,oo < + 



'T Vn\l VT 

The proof is then similar to that of Theorem 14. II ■ 

We may consider a stronger assumption that = a for every t E Nr, where a = {aj : 
j E Nm) G I^^^ is an unknown vector to be estimated. Then Theorem 14.21 implies that a is 
a -y/n-consistent (up to logarithms) estimator of all the components of a and the sparsity (and 
sign) pattern of a is correctly recovered by that of a with overwhelming probability. 



5 Non-Gaussian noise 

In this section, we only assume that the random variables Wu, i E N„, t E Nt, are independent 
with zero mean and finite variance IE[VFj^] ^ cr^. In this case the results remain similar to 
those of the previous sections, though the concentration effect is weaker. We need the following 
technical assumption 

Assumption 5.1. The matrix X is such that 

^ T n 
i=l i=l 

for a constant c' > 0. 
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This assumption is quite mild. It is satisfied for example, if all {xti)j are bounded in absolute 
value by a constant uniformly in i,t,j. We have the two following theorems. 

Theorem 5.1. Consider the model (11.11) for Af > 3 and T,n > 1. Assume that the random 
vectors Wi, . . . , Wt are independent with zero mean and finite variance E,[W^] ^ cr^, all diag- 
onal elements of the matrix X^X/n are equal to 1 and M{(3*) < s. Let also Assumption \5J\ be 
satisfied. Furthermore let k be defined as in Assumption 13. 1 1 and 0max the largest eigenvalue 
of the matrix X^ Xj n. Let 

Then with probability at least 1 — ^'^^i°glf)i+s^ . for any solution (3 of problem (12.21) we have 

ni K,^ n 



M(/3) < ^^ZI^s. 



\ n 

/t2 



If in addition, Assumption RE(2s) holds, then with the same probability for any solution (3 of 
problem (12.21) we have 

t"^-^" n • 

Theorem 5.2. Consider the model ^i.il) for M ^ 3 and T,n ^ 1. Let the assumptions of 
Theorem \5.1\ be satisfied and let Assumption \4. A hold with the same s. Set 

'3 



2 + 7(^'"- 

Let X be as in Theorem as m l5.il Then with probability at least 1 — ^Q^^^l.^yji+l > for any solution 
P of problem ( \2.2\i we have 



11/3 -/5 ||2,oo ^ C\' 



'T V n 

If, in addition, it holds that 



ieJ{/3*) ^ \ n ' 

then with the same probability for any solution (3 of problem {\2.2^ the set of indices 



/rp" V n 

estimates correctly the sparsity pattern J {(3*): 

J=J{P*). 



16 



The proofs of these theorems are similar to the ones of Theorems 13.11 and |4 . 1 1 up to a modi- 
fication of the bound on P(A^) in Lemma [STl We consider now the event 



A 



M 

max 



i=l \i=l 



The Markov inequality yields that 



Pr(^^) ^ 



(AnT)2 

Then we use Lemma lAT2l given below with the random vectors 

Yti = {{xti)iWu/n, {xti)MWu/n) G M^, 
Vi e Rn, Vt e Nt. We get that 

T n 



2elogM-e 2 1 



-a 



—yy 

nT ^ ^1 



t=l i=l 



max \{xu)j\ 



By the definition of A in Theorem 5.2 and Assumption 15 . 1 1 we obtain 

(2elogM-e)c' 



(logM) 



1+5 



Thus, we see that under the finite variance assumption on the noise, the dependence on the 
dimension M cannot be made negligible for large T. 



A Auxiliary results 

Here we collect two auxiliary results which are used in the above analysis. The first result is a 
useful bound on the tail of the chi-square distribution. 

Lemma A.l. Let Xt ^ chi-square random variable with T degrees of freedom. Then 



Pr(xy > T + x) < exp I — min I x 



X 

Y 



for all a; > 0. 



Proof. By the Wallace inequality IWallacd [|l959ll we have 

Pr(xT > T + x) < Pr(7V > z{x)), 
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where is the standard normal random variable and z{x) = \/ x — Tlog(l + x/T). The result 
now follows from inequalities Pr(A/' > z{x)) < exp(— 2;^(x)/2) and 



u — log(l + u) > 



u 



2(1 + m) - 4 



> - min (m, u^) , Vm > 0. 



The next result is a version of Nemirovski's inequality (see Diimbgen et alj [|2008ll . Corol- 
lary 2.4 page 5). 

Lemma A.2. Let Yi, . . . , y„ G be independent random vectors with zero means and finite 
variance, and let M ^ 3. Then 



E 



2 



j=l 



(2elogM-e)X^E[|l^.iy 



where 



is the norm. 
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